library(dotenv)
library(tidyverse)
library(ggplot2)
library(sf)
library(tidycensus)
library(stringr)
library(magrittr)
library(jsonlite)
library(httr)
library(tidymodels)
library(knitr)
library(kableExtra)
library(data.table)
library(patchwork)
library(vip)
library(Metrics)
# Clone the private github repository for this assignment using SSH link:
# git@github.com:stejamilla/final-project.git
# do not use scientific notation
options(scipen = 999)Final Project
Data Science for Public Policy
About This Project
Urban environments tend to retain heat because they have less tree/vegetation cover, more impervious surfaces such as asphalt, and other differentiating factors compared to those experienced in non-urban environments. Using data on DC’s landcover, (Urban Tree Canopy by Census Block Group in 2020) we built and tested three different models to predict where hotspots are most likely to occur within the District. We also included several variables from the American Community Survey (ACS) to test whether demographic characteristics could also be predictive of hotspots. Our analysis is at the block group level.
The main outcome variable is the average evening temperature, which is part of the DC landcover dataset. This was created by the Sustaining Urban Places Research Lab (SUPR) at Portland State University in 2018. While we attempted to create our own temperature data leveraging remote sensing raster data from satellites like MODIS, the publicly available data was at too large of a resolution to capture different temperatures in each block group the way that SUPR was able to accomplish. This data limitation, combined with how the DC government–our intended end-user–has employed this specific SUPR temperature variable within its own 2020 dataset, has led us to leverage this variable as our key outcome variable. Notably, DC has chosen to use average evening temperatures (instead of morning or afternoon temperatures) in their landcover dataset. We believe this is because surface and air temperatures tend to be vastly different during the daytime but more similar at night–thus potentially accounting for a more consistent temperature experience. Additionally, built urban areas that have a high concentration of impervious surfaces like roads and concrete cool much more slowly at night compared to non-built areas, heightening the temperature disparity and emphasizing the urban heat island effect of cities.
Setup
Loading packages:
Prepare the Data (Loading and Cleaning)
DC Land Cover Data
Load data from Open Data DC (landcover, ward geometry):
# Load Urban Tree Canopy by Census Block Group (2020) data
landcover <- read_sf(paste0("data/Urban_Tree_Canopy",
"_by_Census_Block_Group_in_2020.shp")) |>
rename(block_group = GEOID) |>
mutate(block_group = as.character(block_group)) |>
mutate(OBJECTID = as.character(OBJECTID)) |>
rename_all(tolower) |>
relocate(block_group, .after = objectid) |>
# Converting UHI to Fahrenheit
mutate(uhi = (uhi*(9/5) + 32))
# Load DC Ward data from Urban Tree Canopy by Ward (2020)
wards <- read_sf("data/Urban_Tree_Canopy_by_Ward_in_2020.shp") |>
rename_all(tolower) |>
select(ward) |>
st_transform(crs = st_crs(landcover))Categorize each block group by ward:
# Spatial join so that each block group is assigned a ward based on majority
# area of a block group within a ward
landcover <- st_join(landcover, wards, largest = TRUE)
# Map the overlapping wards & block groups
landcover |>
mutate(ward = as.character(ward)) |>
ggplot() +
geom_sf(aes(fill = ward)) +
theme_void() +
labs(title = "District of Columbia Census Block Groups, by Ward",
subtitle = "2020",
caption = paste0("Source: Census American Community Survey, 2016 - 2020",
"\nDC Open Data Urban Tree Canopy by Census Block",
" Group, 2020"),
fill = "Ward") If a ward intersected a block group, we classified that block group as a ward. If a block group overlapped multiple wards, we selected the ward that contained the largest portion of the block group’s area to serve as that block group’s ward for the study.
Census Demographic Data
Load ACS 5-yr 2016-2020 variables:
Load select Census ACS 5-yr 2016 - 2020 variables via API
# Secure Census API key (from .env file) for use in Census API call
load_dot_env()
credential <- Sys.getenv("census_api_key")
# Create url for API call
url <- str_glue(paste0("https://api.census.gov/data/2020/acs/acs5?get=",
# Description of data point
"NAME,",
# TOTAL POPULATION ESTIMATE
"B01003_001E,",
# TOTAL POPULATION BY RACE - WHITE ALONE
"B02001_002E,",
# TOTAL POPULATION BY RACE - BLACK OR AFR AMER ALONE
"B02001_003E,",
# TOTAL POPULATION BY RACE - AMER IND AND AK NAT ALONE
"B02001_004E,",
# TOTAL POPULATION BY RACE - ASIAN ALONE
"B02001_005E,",
# TOTAL POPULATION BY RACE - NAT HAW OR PAC ISL ALONE
"B02001_006E,",
# TOTAL POPULATION BY RACE - OTHER ALONE + TWO OR MORE RACES
"B02001_008E,",
# TOTAL HISPANIC OR LATINO
"B03003_003E,",
# TOTAL POP WITH DOCTORATE DEGREE
"B15003_025E,",
# TOTAL POP WITH PROFESSIONAL SCHOOL DEGREE
"B15003_024E,",
# TOTAL POP WITH MASTER'S DEGREE
"B15003_023E,",
# TOTAL POP WITH BACHELORS DEGREE
"B15003_022E,",
# TOTAL POP WITH ASSOCIATES DEGREE
"B15003_021E,",
# TOTAL POP WITH REGULAR HIGH SCHOOL DIPLOMA
"B15003_017E,",
# TOTAL POP WITH GED OR ALTERNATIVE CREDENTIAL
"B15003_018E,",
# MEDIAN HH INCOME IN PAST 12 MOS
"B19013_001E,",
# TOTAL HOUSING UNITS
"B25002_001E,",
# TOTAL OCCUPIED HOUSING UNITS
"B25002_002E,",
# TOTAL VACANT HOUSING UNITS
"B25002_003E,",
# MEDIAN PROPERTY VALUE
"B25077_001E,",
# GROSS RENT AS A % OF HH INCOME IN PAST 12 MO -
# TOTAL
"B25070_001E,",
# GROSS RENT AS A % OF HH INCOME IN PAST 12 MO -
# 30 TO 34.9 %
"B25070_007E,",
# GROSS RENT AS A % OF HH INCOME IN PAST 12 MO -
# 35 TO 39.9 %
"B25070_008E,",
# GROSS RENT AS A % OF HH INCOME IN PAST 12 MO -
# 40 TO 49.9 %
"B25070_009E,",
# GROSS RENT AS A % OF HH INCOME IN PAST 12 MO -
# 50 % or more
"B25070_010E",
# Limit to block-group-level data for DC
"&for=block%20group:*&for=tract:*&in=state:11&in=county:001"))
# Use url to request data from API
acs_json <- GET(url = url)
# Check call status
http_status(acs_json)$category
[1] "Success"
$reason
[1] "OK"
$message
[1] "Success: (200) OK"
# Save API JSON response as text
acs_json <- content(acs_json, as = "text")
# Save JSON as character matrix
acs_matrix <- fromJSON(acs_json)
# Convert matrix to tibble
acs_data <- as_tibble(acs_matrix[2:nrow(acs_matrix), ],
.name_repair = "minimal")
# Add variable names to tibble
names(acs_data) <- acs_matrix[1, ] |>
str_replace(" ", "_")Label and clean select ACS variables for model, or to be used in new measure variables for model:
# Clean up ACS data
acs_data <- acs_data |>
# Meaningful variable names
rename(total_pop = B01003_001E,
race_white = B02001_002E,
race_black = B02001_003E,
race_ai_an = B02001_004E,
race_asian = B02001_005E,
race_nh_pi = B02001_006E,
race_oth_mult = B02001_008E,
race_eth_hisp = B03003_003E,
ed_doct = B15003_025E,
ed_prof_deg = B15003_024E,
ed_master = B15003_023E,
ed_bach = B15003_022E,
ed_assoc = B15003_021E,
ed_ged = B15003_018E,
ed_reg_hsd = B15003_017E,
med_hhi = B19013_001E,
med_prop_val = B25077_001E,
rent_burden_tot = B25070_001E,
rent_burden_30_34 = B25070_007E,
rent_burden_35_39 = B25070_008E,
rent_burden_40_49 = B25070_009E,
rent_burden_50plus = B25070_010E,
total_hous_units = B25002_001E,
vac_units = B25002_003E,
occ_units = B25002_002E,
block_group_temp = block_group) |>
# Create block_group column to use for combining this with the other datasets
mutate(block_group = paste(state,
county,
tract,
block_group_temp,
sep = "")) |>
# Convert Census variables to numeric
mutate_at(vars(total_pop:rent_burden_50plus), as.numeric) |>
rename_all(tolower) |>
relocate(block_group, .before = name)
# Convert rent burden variables into a percentage of rent burdened households,
# then drop original rent burden variables
acs_data <- acs_data |>
mutate(
rent_burden_sum =
rent_burden_30_34 +
rent_burden_35_39 +
rent_burden_40_49 +
rent_burden_50plus
) |>
mutate(
rent_burden_pct = (rent_burden_sum / rent_burden_tot)
) |>
select(-c(rent_burden_30_34,
rent_burden_35_39,
rent_burden_40_49,
rent_burden_50plus,
rent_burden_tot,
rent_burden_sum)
)When selecting which ACS variables to use, we wanted to include various characteristics of block groups. We chose to include total population and racial disaggregation of population to understand how the volume of population could affect temperature and if demographics of that population could be an important predictor. Levels of education are another way to understand the population in the block group, especially if people of similar levels of educational attainment live close to each other. Median household income and rent burden are economic indicators and consider how economic conditions could affect the conditions of a block group. Lastly, number of housing units, vacant units, occupied units, and median property value are all economic characteristics of the block group (rather than their residents), that may contribute to the physical environment of that block group.
Data Imputation Part 1
A few of our Census variables (rent burden, median household income, median property value) are missing data for some block groups. This seems to be due to insufficient response rates in those block groups: the standard error of the estimate was larger in magnitude than the estimate itself, so the Census Bureau suppresses that estimate.
Nearby block groups are likely to share similar characteristics in terms of property values, median income, and rent burdens, since those tend to reflect neighborhood characteristics that exist at a larger geography than the block group. Therefore, we came up with the following imputation plan:
If data are missing at the block group level, the NA block group is assigned the mean value of the variable from neighboring block groups who share a census tract.
If data are still missing at the census tract level, use the mean for neighboring tracts in the corresponding ward. (This happens in the “Data Imputation Part 2” section below.)
We recognize that this method is imperfect. First, ward-level imputation is less precise since wards are larger geographies. Second, the method may give too much “weight” to block groups that share a Census tract with an NA block group. For example, if within a ward we had to impute data for a couple block groups based on tract-level averages and then imputed data for other block groups based on the ward-level average, then that means the ward-level average partially was influenced by other already-imputed values (using tract-level averages).
However, there were only a handful of instances of ward-level imputation for the first two variables:
Rent burden: 7 block groups are imputed at the ward level
Median HHI: 5 block groups ” ”
And a fairly small number for property value (less than 5% of observations):
- Median property value: 27 block groups imputed at the ward level
Another limitation is that median househould income and median property value variables are top- and bottom-coded, meaning that imputed values (based on averages) may be lower or higher than they should be due to the top/bottom coding.
Imputation Part 1: Replace missing data points with tract-level data where available.
# Create tract dataframe
tract_data <- acs_data |>
mutate(med_hhi = if_else(
med_hhi < 0, NA, med_hhi),
med_prop_val = if_else(
med_prop_val < 0, NA, med_prop_val)
) |>
group_by(tract) |>
summarize(
rb_tract = mean(rent_burden_pct, na.rm = TRUE),
hhi_tract = mean(med_hhi, na.rm = TRUE),
propval_tract = mean(med_prop_val, na.rm = TRUE)
)
# Replace missing data with tract data:
acs_data <- acs_data |>
left_join(tract_data, by = "tract") |>
# Rent burden
mutate(
# Create indicator for imputed vars
rb_imputed = if_else(
is.na(rent_burden_pct), 1, 0),
# Tract-level imputation
rent_burden_pct = if_else(
is.na(rent_burden_pct), rb_tract, rent_burden_pct)
) |>
# Median HHI
mutate(
# Create indicator for imputed vars
hhi_imputed = if_else(
med_hhi < 0, 1, 0),
# Tract-level imputation
med_hhi = if_else(
med_hhi < 0, hhi_tract, med_hhi)
) |>
# Median property value
mutate(
# Create indicator for imputed vars
propval_imputed = if_else(
med_prop_val < 0, 1, 0),
# Tract-level imputation
med_prop_val = if_else(
med_prop_val < 0, propval_tract, med_prop_val)
)Create the Final Dataset for Analysis
Determine Threshold for Hotspot vs. Not Hotspot
Decide the threshold for a hotspot:
# Visualize a cumulative histogram of counts of UHI values
landcover |>
group_by(uhi) |>
ggplot(aes(uhi)) +
geom_histogram(aes(y = cumsum(..count..))) +
labs(title = paste0("Distribution of Urban Heat Index (UHI) Variable",
" District of Columbia"),
subtitle = "2020",
caption = paste0("\nSource: DC Open Data Urban Tree Canopy by Census",
"\nBlock Group, 2020, and Portland State University")) +
xlab("Urban Heat Index (UHI)") +
ylab("Frequency")# Calculate / visualize frequency & cumulative frequency
heat_freq <- landcover |>
count(uhi)
heat_freq$cumulative <- cumsum(heat_freq$n)
# Calculate the standard deviation and the mean of uhi
heat_freq |> summarize(
sd(uhi),
mean(uhi)
)Simple feature collection with 1 feature and 2 fields
Geometry type: POLYGON
Dimension: XY
Bounding box: xmin: 389616.6 ymin: 124877.9 xmax: 407860.4 ymax: 147545.8
Projected CRS: NAD83 / Maryland
# A tibble: 1 × 3
`sd(uhi)` `mean(uhi)` geometry
<dbl> <dbl> <POLYGON [m]>
1 1.80 89.8 ((394387 136007.6, 394387.7 136010.7, 394388 136012.3, …
# Calculate how many block groups have one standard deviation above the mean
heat_freq |>
filter(uhi >= 91.6) |>
count()Simple feature collection with 1 feature and 1 field
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 395375 ymin: 134605.6 xmax: 401893.4 ymax: 143306.4
Projected CRS: NAD83 / Maryland
# A tibble: 1 × 2
n geometry
* <int> <MULTIPOLYGON [m]>
1 82 (((395651.3 136857, 395651.4 136859.2, 395657.1 136859.3, 395664.3 1368…
The threshold for “hotspot” is having a uhi (average evening temperature) of 91.6 degrees Fahrenheit, which is one standard deviation (1.8 degrees) above the mean. This means that 82 of the 571 block groups (~14%) would be classified as a hotspot. We decided on this threshold with the belief that a hotspot should represent a block group that meaningfully has an average higher temperature than the majority of the sample. Using standard deviation as a benchmark seemed to be an appropriate way to naturally divide the data.
Data Imputation Part 2
Imputation part 2: for any data points still missing after imputing at tract level, we impute at the ward level.
# Create ward-level dataframe
ward_data <- dc_full_data |>
group_by(ward) |>
summarize(
rb_ward = mean(rent_burden_pct, na.rm = TRUE),
hhi_ward = mean(med_hhi, na.rm = TRUE),
propval_ward = mean(med_prop_val, na.rm = TRUE)
) |>
st_drop_geometry()
# Replace missing data with ward data:
# Rent burden
dc_full_data <- dc_full_data |>
left_join(ward_data, by = "ward") |>
mutate(
rent_burden_pct = if_else(
is.na(rb_tract), rb_ward, rent_burden_pct),
med_hhi = if_else(
is.na(hhi_tract), hhi_ward, med_hhi),
med_prop_val = if_else(
is.na(propval_tract), propval_ward, med_prop_val)
)
# Drop columns no longer needed
dc_full_data <- dc_full_data |>
select(-c(rb_tract, rb_ward,
hhi_tract, hhi_ward,
propval_tract, propval_ward))Finalize the Full Dataset
We create a variety of metrics using mutations of our imported data.
dc_full_data <- dc_full_data|>
mutate(
# convert total acres to sq miles
total_sq_mi = total_ac * 0.0015625,
# convert water acres to sq miles
water_sq_mi = wat_ac * 0.0015625,
# calculate water share of total sq miles
share_water_sq_mi = water_sq_mi / total_sq_mi,
# calculate population density per square mile
pop_dens_sq_mi = total_pop / total_sq_mi,
# calculate population density per water sq mile
pop_density_water_sq_mi = ifelse(
water_sq_mi !=0, total_pop / water_sq_mi, 0),
# calculate ratio of housing units to population
hous_units_per_person = ifelse(
total_hous_units != 0, total_hous_units / total_pop, 0),
# calculate ratio of vacant units to all units
vac_units_share = ifelse(
total_hous_units != 0, vac_units / total_hous_units, 0),
# calculate ratio of total income to median home value
inc_home_val_ratio = med_hhi / med_prop_val,
# create combined HS diploma or equivalent variable
ed_comb_hsd = ed_reg_hsd + ed_ged,
# create combined advanced post-undergraduate variable
ed_comb_adv_deg = ed_doct + ed_prof_deg + ed_master,
# calculate ratio of 25+ pop with HSD to total pop
ed_hsd_ratio = ifelse(
ed_comb_hsd !=0, ed_comb_hsd / total_pop, 0),
# calculate ratio of 25+ pop with bachelor's degree to total pop
ed_bach_ratio = ifelse(
ed_bach != 0, ed_bach / total_pop, 0),
# calculate ratio of 25+ pop with advanced post-bach degree to total pop
ed_adv_ratio = ifelse(
ed_comb_adv_deg != 0, ed_comb_adv_deg / total_pop, 0),
# calculate pop share white
race_white_ratio = ifelse(
race_white != 0, race_white / total_pop, 0),
# calculate pop share black or african american
race_black_ratio = ifelse(
race_black != 0, race_black / total_pop, 0),
# calculate pop share american indian or alaska native
race_ai_an_ratio = ifelse(
race_ai_an!= 0, race_ai_an / total_pop, 0),
# calculate pop share asian
race_asian_ratio = ifelse(
race_asian != 0, race_asian / total_pop, 0),
# calculate pop share native hawaiian or pacific islander
race_nh_pi_ratio = ifelse(
race_nh_pi != 0, race_nh_pi / total_pop, 0),
# calculate pop share two or more races
race_oth_mult_ratio = ifelse(
race_oth_mult != 0, race_oth_mult / total_pop, 0),
# calculate pop share hispanic or latino
race_eth_hisp_ratio = ifelse(
race_eth_hisp != 0, race_eth_hisp / total_pop, 0),
# calculate total non-white population
race_poc = race_black + race_ai_an + race_asian + race_nh_pi + race_oth_mult,
# calculate pop share non-white
race_poc_ratio = ifelse(
race_poc != 0, race_poc / total_pop, 0),
# creating hotspot var for decision tree model
hotspot = if_else(
uhi >= 91.6, "hotspot", "not hotspot"),
# remove factor from uhi var
uhi = as.numeric(as.character(uhi))) |>
relocate(c("block_group_temp", "tract"), .after = 2) |>
# remove columns with only one unique value
select(-objectid,
-name,
-state,
-statefp,
-county,
-countyfp,
-tractce,
-blkgrpce,
-namelsad,
-mtfcc,
-funcstat,
-field,
-shapearea,
-shapelen) |>
relocate(race_white_ratio:race_eth_hisp_ratio,
.after = race_eth_hisp) |>
relocate(ed_comb_hsd:ed_adv_ratio,
.after = ed_ged) |>
relocate(hous_units_per_person:inc_home_val_ratio,
.after = med_prop_val) |>
relocate(total_sq_mi:pop_density_water_sq_mi,
.after = inc_home_val_ratio) |>
relocate(total_pop:hotspot, .after = tract)Evaluating Imputation Via Visualization
Missing data have been imputed using the next smallest geography available. In order to assess the accuracy of these imputations, we plot the variables with imputed values against another variable with no imputed values, one which should have some correlation or general relation with the partially imputed variable. We can then see if the imputed values follow the general trend of the correlation or not.
Rent Burden imputation evaluation
Compared to the other partially-imputed variables, it was harder to find variables that were clearly correlated with rent burden (the percentage of households in a given tract spending 30% or more of their income on rent). However, to the extent that trends could be seen in visualizations of the data, our imputed values seem to follow those trends.
Rent burden vs. percent of population with a bachelor’s degree:
dc_full_data |> ggplot() +
geom_point(mapping =
aes(x = ed_bach_ratio, y = rent_burden_pct,
color = as.character(rb_imputed)),
alpha = 0.4
) +
scale_color_manual(labels = c("No", "Yes"), values = c("#2271B5", "#8E133B")) +
labs(title = "Evaluating Rent Burden Imputation",
subtitle = "Comparison with % population with a Bachelor's degree",
color = "Imputed?") +
xlab("Proportion of Population with a Bachelor's Degree") +
ylab("Proportion of Households with Rent Burden") +
theme_minimal()There is no strong trend here, but no imputed data points seem to fall outside the general shape of the data.
Rent burden vs. ward:
dc_full_data |> ggplot() +
geom_point(mapping =
aes(x = ward, y = rent_burden_pct,
color = as.character(rb_imputed)),
alpha = 0.4
) +
scale_color_manual(labels = c("No", "Yes"), values = c("#2271B5", "#8E133B")) +
labs(title = "Evaluating Rent Burden Imputation",
subtitle = "Comparison with ward",
color = "Imputed?") +
xlab("Ward") +
ylab("Proportion of Households with Rent Burden") +
theme_minimal()No imputed values fall outside the range for their ward. (Since a few imputed values were imputed at the ward level, we would not want to rely on this alone - but it provides additional reassurance alongside other comparisons.)
Median household income imputation evaluation
Median household income vs. people of color ratio:
dc_full_data |> ggplot() +
geom_point(mapping =
aes(x = race_poc_ratio, y = med_hhi,
color = as.character(hhi_imputed)),
alpha = 0.4
) +
scale_color_manual(labels = c("No", "Yes"), values = c("#2271B5", "#8E133B")) +
labs(title = "Evaluating Median Household Income Imputation",
subtitle = "Comparison with POC % of population",
color = "Imputed?") +
xlab("Proportion of Population POC") +
ylab("Median Household Income") +
theme_minimal()There is a clear negative correlation between median household income and POC ratio. The imputed data points seem to follow this trend. There is one datapoint where race_poc_ratio = 0 and imputed med_hhi looks to be somewhat too low, but since it’s only one data point, we are not too worried about systematic inaccuracy of the imputed values.
Median household income vs. percent of population with a bachelor’s degree:
dc_full_data |> ggplot() +
geom_point(mapping =
aes(x = ed_bach_ratio, y = med_hhi,
color = as.character(hhi_imputed)),
alpha = 0.4
) +
scale_color_manual(labels = c("No", "Yes"), values = c("#2271B5", "#8E133B")) +
labs(title = "Evaluating Median Household Income Imputation",
subtitle = "Comparison with % of population with a bachelor's degree",
color = "Imputed?") +
xlab("Proportion of Population with a Bachelor's Degree") +
ylab("Median Household Income") +
theme_minimal()There is less of a strong correlation here, but again, the imputed data points seem to fit the overall trend. There is an outlier in terms of ed_bach_ratio (= 1.00); arguably this should have had a larger value for med_hhi, but again, it is just one data point.
Median property value imputation evaluation
Median property value vs. POC ratio:
dc_full_data |> ggplot() +
geom_point(mapping =
aes(x = race_poc_ratio, y = med_prop_val,
color = as.character(propval_imputed)),
alpha = 0.4
) +
scale_color_manual(labels = c("No", "Yes"), values = c("#2271B5", "#8E133B")) +
labs(title = "Evaluating Median Property Value Imputation",
subtitle = "Comparison with POC % of population",
color = "Imputed?") +
xlab("Proportion of Population POC") +
ylab("Median Property Value") +
theme_minimal()There is a clear negative relationship between median property value and race_poc_ratio. The imputed data points seem to follow this trend.
Median property value vs. percent of population with a bachelor’s degree:
# Median property value vs. bachelor's degree ratio
dc_full_data |> ggplot() +
geom_point(mapping =
aes(x = ed_bach_ratio, y = med_prop_val,
color = as.character(propval_imputed)),
alpha = 0.4
) +
scale_color_manual(labels = c("No", "Yes"), values = c("#2271B5", "#8E133B")) +
labs(title = "Evaluating Median Property Value Imputation",
subtitle = "Comparison with % of population with a bachelor's degree",
color = "Imputed?") +
xlab("Proportion of Population with a Bachelor's Degree") +
ylab("Median Property Value") +
theme_minimal()The trend here is not as clear, but there seems to be a positive relationship between med_prop_val and ed_bach_ratio, and the imputed data points follow this trend. Again, we have one potentially suspicious data point for which ed_bach_ratio = 1.00 and imputed med_prop_val should perhaps be higher, but it’s a single data point.
After performing these analysis, we remove the imputation indicator variables:
Prepare the Data for Analysis (Seed Setting & Splitting)
Conduct Exploratory Data Analysis
Review Distribution of Median Household Income
# Median household income, continuous
dc_train |>
ggplot() +
geom_histogram(aes(x = as.numeric(med_hhi))
) +
theme_minimal() +
labs(title = paste0("Median Household Income Distribution in the District of",
" Columbia"),
subtitle = "2020",
caption = "\nSource: Census American Community Survey, 2016 - 2020") +
xlab("Median Household Income") +
ylab("Frequency") +
scale_x_continuous(labels = scales::label_dollar())A check of the distribution of the median household income variable does not reveal any red flags about the validity of the data.
Explore Temperature in Relation to Other Variables
dc_full_data |> ggplot() +
geom_sf(
mapping = aes(fill = uhi),
color = "#400001"
) +
scale_fill_gradient(
low = "#FCF4B3",
high = "#940000"
) +
labs(
title = "Average Evening Temperature (2018)",
fill = "Fahrenheit",
caption = paste0("Source: DC Open Data Urban Tree Canopy by Census",
"\nBlock Group, 2020, and Portland State University")) +
theme_void()This map generally demonstrates the average evening temperature of block groups. Notably, Rock Creek Park is a particular area that is relatively cooler compared to the rest of the city. While it was difficult to find information on when exactly SUPR logged temperatures, we found the Tacoma, WA’s version of this dataset on the city’s open data portal, whose description claims that temperatures were from July 2018. Thus, we assume that DC temperatures were also taken that summer.
Temperature & population:
# Total population
total_pop_eda <- dc_train |>
ggplot(aes(x = total_pop, y = uhi, color = hotspot)) +
geom_point(alpha = 0.5) +
theme_minimal() +
labs(title = "UHI and Total Population",
subtitle = "By Block Group",
caption = paste0("Source: Census American\nCommunity Survey, 2016-2020;",
"\nDC Open Data Urban Tree\nCanopy by Census Block",
"\nGroup, 2020, and\nPortland State University"),
color = "Hotspot") +
xlab("\nTotal Population") +
ylab("UHI") +
scale_x_continuous(labels = scales::label_comma())
# Population density
pop_density_eda <- dc_train |>
ggplot(aes(x = pop_dens_sq_mi, y = uhi, color = hotspot)) +
geom_point(alpha = 0.5) +
theme_minimal() +
labs(title = "UHI and Population Density",
subtitle = "By Block Group",
caption = paste0("Source: Census American\nCommunity Survey, 2016-2020;",
"\nDC Open Data Urban Tree\nCanopy by Census Block",
"\nGroup, 2020, and\nPortland State University"),
color = "Hotspot") +
xlab("\nPopulation per Square Mile") +
ylab("UHI") +
scale_x_continuous(labels = scales::label_comma())
# Non-white population density
poc_pop_eda <- dc_train |>
ggplot(aes(x = race_poc_ratio, y = uhi, color = hotspot)) +
geom_point(alpha = 0.5) +
theme_minimal() +
labs(title = "UHI and Non-White\nPopulation Density",
subtitle = "By Block Group",
caption = paste0("Source: Census American\nCommunity Survey, 2016-2020;",
"\nDC Open Data Urban Tree\nCanopy by Census Block",
"\nGroup, 2020, and\nPortland State University"),
color = "Hotspot") +
xlab("\nNon-White Population Proportion") +
ylab("UHI")
total_pop_eda + pop_density_eda + poc_pop_eda Population variables do not indicate a strong relationship with UHI/temperature.
Temperature & income/education:
# Median household income
mhi_eda <- dc_train |>
ggplot(aes(x = med_hhi, y = uhi, color = hotspot)) +
geom_point(alpha = 0.5) +
theme_minimal() +
labs(title = "UHI and Median Household Income",
subtitle = "By Block Group",
caption = paste0("Source: Census American\nCommunity Survey, 2016-2020;",
"\nDC Open Data Urban Tree\nCanopy by Census Block",
"\nGroup, 2020, and\nPortland State University"),
color = "Hotspot") +
xlab("Household Income") +
ylab("UHI") +
scale_x_continuous(labels = scales::label_comma())
# Ratio of bachelors degrees
bach_eda <- dc_train |>
ggplot(aes(x = ed_bach_ratio, y = uhi, color = hotspot)) +
geom_point(alpha = 0.5) +
theme_minimal() +
labs(title = "UHI and College Degree Attainment",
subtitle = "By Block Group",
caption = paste0("Source: Census American\nCommunity Survey, 2016-2020;",
"\nDC Open Data Urban Tree\nCanopy by Census Block",
"\nGroup, 2020, and\nPortland State University"),
color = "Hotspot") +
xlab("Proportion of population with a bachelor's degree") +
ylab("UHI") +
scale_x_continuous(labels = scales::label_comma())
mhi_eda + bach_edaSimilarly, variables indicating economic status and educational attainment do not show strong relationship with UHI.
Temperature & landcover:
# Percentage area of urban tree canopy
utc_eda <- dc_train |>
ggplot(aes(x = utc_pct, y = uhi, color = hotspot)) +
geom_point(alpha = 0.5) +
theme_minimal() +
labs(title = "UHI and Urban Tree Canopy",
subtitle = "By Block Group",
caption = paste0("Source: Census American\nCommunity Survey, 2016-2020;",
"\nDC Open Data Urban Tree\nCanopy by Census Block",
"\nGroup, 2020, and\nPortland State University"),
color = "Hotspot") +
xlab("% Land Area Covered By Urban Tree Canopy") +
ylab("UHI")
# Percentage area of impervious surfaces
imp_eda <- dc_train |>
ggplot(aes(x = to_ia_pct, y = uhi, color = hotspot)) +
geom_point(alpha = 0.5) +
theme_minimal() +
labs(title = "UHI and Impervious Surfaces",
subtitle = "By Block Group",
caption = paste0("Source: Census American\nCommunity Survey, 2016-2020;",
"\nDC Open Data Urban Tree\nCanopy by Census Block",
"\nGroup, 2020, and\nPortland State University"),
color = "Hotspot") +
xlab("% Land Area Covered By Impervious Surfaces") +
ylab("UHI")
utc_eda + imp_edaMeanwhile, landcover variables show a potentially strong relationship with UHI. Generally, it seems that as urban tree canopy area increases, UHI decreases. Conversely, as the percent area of impervious surfaces (e.g., roads, parking lots, etc.) increases, UHI also increases. This may indicate that landcover variables may hold more importance in our models compared to demographic ones.
Preparation for Modeling
# Prep the dataset for modeling
dc_train <- dc_train |>
st_drop_geometry() |>
select(
# remove identifying vars
-block_group,
-intptlat,
-intptlon,
-gis_id,
-globalid, gid,
-tract,
-ward,
-block_group_temp,
# remove raw total pop var (custom pop density and other pop proportion
# vars remain)
-total_pop,
# remove raw race vars (custom proportion vars remain)
-race_white,
-race_black,
-race_ai_an,
-race_asian,
-race_nh_pi,
-race_oth_mult,
-race_eth_hisp,
# remove raw ed vars (custom proportion vars remain)
-ed_doct,
-ed_prof_deg,
-ed_master,
-ed_bach,
-ed_assoc,
-ed_reg_hsd,
-ed_ged,
-ed_comb_hsd,
-ed_comb_adv_deg,
# remove duplicative housing data (custom proportion vars and pct var
# remain)
-total_hous_units,
-occ_units,
-vac_units,
# remove duplicative land data (custom proportion / "pct" vars remain)
-ends_with("_ac"),
-total_sq_mi,
-water_sq_mi,
-wtr_pct,
-aland,
-awater,
-utcac0620,
-utcac1120,
-utcac1520,
-utcpct0620,
-utcpct1120,
-utc_pct_06,
-utc_pct_11,
-utc_pct_15,
# remove pop_density_water_sq_mi due to INF value for most records
-pop_density_water_sq_mi)
# V-fold cross validation with 10 folds
dc_folds <- vfold_cv(data = dc_train, v = 10)Decision Tree
Modeling
dt_recipe <- recipe(formula = hotspot ~ ., data = dc_train) |>
# Remove uhi var to avoid duplication of data from hotspot
step_rm(uhi) |>
themis::step_downsample(hotspot) |>
step_nzv(all_numeric_predictors())
dt_model <- decision_tree() |>
set_engine(engine = "rpart") |>
set_mode(mode = "classification")
dt_workflow <- workflow() |>
add_recipe(dt_recipe) |>
add_model(dt_model)
dt_fitting <- dt_workflow |>
fit(data = dc_train)
rpart.plot::rpart.plot(x = dt_fitting$fit$fit$fit,
roundint = FALSE)The first decision point is whether the percentage of the block group that’s covered by tree canopy; if it’s greater or equal to 23% then it’s not a hotspot. If it is less than 23%, then you check how suitable the block group is for tree planting “based on a weighted formula that includes all planting prioritization categories.” If the score is less than 7, then it’s a not hotspot.
If it’s greater than or equal to seven, you check the percent of total area that’s covered by all combined impervious surfaces. If the percentage is less than 66%, then it’s not a hotspot. If it’s greater than or equal to 66%, then the final check is whether the number of housing units per person is less than 0.74. If it’s greater than or equal to 0.75, then it’s not a hotspot. If it is less than 0.74 then it is a hotspot.
Interestingly, the decision tree focuses on landcover variables. This is consistent with the EDA conducted that suggest that the temperature/uhi data have a notable relationship with landcover variables but not demographic variables. The last decision was a surprise because it initially seems counter-intuitive that more housing per person would lead to a block group not being a hotspot. Perhaps this is because more housing units per person mean that units are less densely occupied and therefore could contribute to cooler conditions.
Visualize Variables in the Decision Tree
canopy_map <- dc_full_data |> ggplot() +
geom_sf(aes(fill = utc_pct)) +
scale_fill_gradient(
low = "#fcbc32",
high = "#4e974b"
) +
theme_void() +
labs(
title = "Percentage of Tree Canopy",
subtitle = "By Block Group",
fill = "Percent Area",
caption = paste0("\nSource: DC Open Data\nUrban Tree Canopy by\nCensus Block",
" Group, 2020,\nand Portland State University\n\n"))
planting_map <- dc_full_data |> ggplot() +
geom_sf(aes(fill = overall)) +
scale_fill_gradient(
low = "#e2e2e2",
high = "#16501a"
) +
theme_void() +
labs(
title = "Tree Planting Suitability Score",
subtitle = "By Block Group",
fill = "Score",
caption = paste0("\nSource: DC Open Data\nUrban Tree Canopy by\nCensus Block",
" Group, 2020,\nand Portland State University\n\n"))
impervious_map <- dc_full_data |> ggplot() +
geom_sf(aes(fill = to_ia_pct)) +
scale_fill_gradient(
low = "#ffffff",
high = "#28292b"
) +
theme_void() +
labs(
title = "Percentage Covered By\nImpervious Surfaces",
subtitle = "By Block Group",
fill = "Percent Area",
caption = paste0("\nSource: DC Open Data\nUrban Tree Canopy by\nCensus Block",
" Group, 2020,\nand Portland State University"))
housing_map <- dc_full_data |> ggplot() +
geom_sf(aes(fill = hous_units_per_person)) +
scale_fill_gradient(
low = "#d86400",
high = "#0085d4"
) +
theme_void() +
labs(
title = "Housing Units Per Person",
subtitle = "By Block Group",
fill = "# Housing Units",
caption = paste0("Source: Census American\nCommunity Survey, 2016-2020"))
canopy_map + planting_map + impervious_map + housing_mapEvaluate the Model
Predict on the Testing Data
# Ensuring the testing dataset has the same variables as the training dataset
dc_test <- dc_test |>
st_drop_geometry() |>
select(
# remove identifying vars
-block_group,
-intptlat,
-intptlon,
-gis_id,
-globalid, gid,
-tract,
-ward,
-block_group_temp,
# remove raw total pop var (custom pop density and other pop proportion
# vars remain)
-total_pop,
# remove raw race vars (custom proportion vars remain)
-race_white,
-race_black,
-race_ai_an,
-race_asian,
-race_nh_pi,
-race_oth_mult,
-race_eth_hisp,
# remove raw ed vars (custom proportion vars remain)
-ed_doct,
-ed_prof_deg,
-ed_master,
-ed_bach,
-ed_assoc,
-ed_reg_hsd,
-ed_ged,
-ed_comb_hsd,
-ed_comb_adv_deg,
# remove duplicative housing data (custom proportion vars and pct var
# remain)
-total_hous_units,
-occ_units,
-vac_units,
# remove duplicative land data (custom proportion / "pct" vars remain)
-ends_with("_ac"),
-total_sq_mi,
-water_sq_mi,
-wtr_pct,
-aland,
-awater,
-utcac0620,
-utcac1120,
-utcac1520,
-utcpct0620,
-utcpct1120,
-utc_pct_06,
-utc_pct_11,
-utc_pct_15,
# remove pop_density_water_sq_mi due to INF value for most records
-pop_density_water_sq_mi)
# Creating predictions dataset
dt_predictions <- bind_cols(
dc_test,
predict(object = dt_fitting, new_data = dc_test),
predict(object = dt_fitting, new_data = dc_test, type = "prob")
)
# Editing column names so it's readable
names(dt_predictions) <- names(dt_predictions) |>
str_replace(" ", "_")
# Printing a few predictions
dt_preds <- dt_predictions |>
select(hotspot, .pred_class, .pred_hotspot, .pred_not_hotspot)
kable(dt_preds[1:20,])| hotspot | .pred_class | .pred_hotspot | .pred_not_hotspot |
|---|---|---|---|
| not hotspot | not hotspot | 0.1162791 | 0.8837209 |
| not hotspot | not hotspot | 0.1162791 | 0.8837209 |
| not hotspot | not hotspot | 0.1162791 | 0.8837209 |
| not hotspot | not hotspot | 0.1162791 | 0.8837209 |
| not hotspot | not hotspot | 0.3750000 | 0.6250000 |
| not hotspot | not hotspot | 0.1162791 | 0.8837209 |
| not hotspot | not hotspot | 0.3750000 | 0.6250000 |
| not hotspot | not hotspot | 0.1162791 | 0.8837209 |
| not hotspot | not hotspot | 0.1162791 | 0.8837209 |
| not hotspot | not hotspot | 0.1162791 | 0.8837209 |
| not hotspot | not hotspot | 0.1162791 | 0.8837209 |
| not hotspot | not hotspot | 0.1162791 | 0.8837209 |
| not hotspot | not hotspot | 0.1162791 | 0.8837209 |
| not hotspot | not hotspot | 0.2631579 | 0.7368421 |
| not hotspot | not hotspot | 0.1162791 | 0.8837209 |
| not hotspot | not hotspot | 0.1162791 | 0.8837209 |
| not hotspot | not hotspot | 0.1162791 | 0.8837209 |
| not hotspot | not hotspot | 0.1162791 | 0.8837209 |
| not hotspot | not hotspot | 0.1162791 | 0.8837209 |
| not hotspot | not hotspot | 0.1162791 | 0.8837209 |
Evaluate with Performance Metrics
Create the Confusion Matrix
# Transforming the hotspot variable to become a factor variable
dt_predictions$hotspot <- as.factor(dt_predictions$hotspot)
# Creating confusion matrix
conf_mat(data = dt_predictions,
truth = hotspot,
estimate = .pred_class) Truth
Prediction hotspot not hotspot
hotspot 16 13
not hotspot 1 86
Calculate accuracy, precision, and recall/sensitivity:
| .metric | .estimator | .estimate |
|---|---|---|
| accuracy | binary | 0.8793103 |
| .metric | .estimator | .estimate |
|---|---|---|
| precision | binary | 0.5517241 |
| .metric | .estimator | .estimate |
|---|---|---|
| recall | binary | 0.9411765 |
Accuracy and recall are relatively high at 87.9% and 94.1%, respectively. This means that the model works relatively well in correctly classifying whether or not a block group is a hotspot and in specifically correctly identifying hotspots when that block group actually is a hotspost. Precision is not as high at 55.2%, since there were 13 instances in which the model predicted a block group is a hotspot when in reality it was not one. One possible way to improve the model is to adjust the threshold used to determine whether a block group is a hotspot or not. Another idea may be to examine the predictors used and adjust which ones should be included or not. For example, we made an assumption that using percentage area of the landcover variables rather than the total area variables (e.g., we chose percentage of area covered by canopy rather than total area covered by canopy) to allow consistency across block groups who vary in total area size. Perhaps it would be worth investigating the performance of this model using the total area variables instead or even alongside the percentage variables.
Linear Regression and Random Forest Models
Create a Recipe for the Regression and Random Forest Models
# Create recipe for use in linear regression and random forest models
dc_recipe <- recipe(formula = uhi ~ ., data = dc_train) |>
# remove duplicative hotspot
step_rm(hotspot) |>
# Removing predictors that have near zero variability
step_nzv(all_predictors()) |>
# Removing predictors that are highly correlated with others
step_corr(all_predictors())Linear Regression
Modelling
# Linear regression model
dc_lm_model <-
linear_reg() |>
set_mode(mode = "regression") |>
set_engine(engine = "lm")
# Linear regression workflow
dc_lm_workflow <-
workflow() |>
add_recipe(recipe = dc_recipe) |>
add_model(spec = dc_lm_model)
# Setting up resamples for later evaluation
dc_lm_resamples <- dc_lm_workflow |>
fit_resamples(resamples = dc_folds)Random Forests
Modelling
Hyperparameter Tuning
dc_rf_model <-
rand_forest(
trees = 200,
mtry = tune(),
min_n = tune()) |>
set_mode(mode = "regression") |>
set_engine(
engine = "ranger",
importance = "impurity",
num.threads = 4
)
# Create a parameter grid
dc_rf_grid <- grid_regular(mtry(range = c(1, 15)),
min_n(range = c(1, 15)),
levels = 5)
# Random forest workflow
dc_rf_workflow <-
workflow() |>
add_recipe(dc_recipe) |>
add_model(dc_rf_model)
# Creating a grid to show results of hyper parameter tuning
dc_rf_grid <- grid_regular(
mtry(range = c(1, 15)),
min_n(range = c(1, 15)),
levels = 5
)
# Tuning the grid
rf_resamples <- tune_grid(
dc_rf_workflow,
resamples = dc_folds,
grid = dc_rf_grid
)
# Showing the best hyperparameters based on model performance
kable(show_best(rf_resamples))| mtry | min_n | .metric | .estimator | mean | n | std_err | .config |
|---|---|---|---|---|---|---|---|
| 15 | 8 | rmse | standard | 0.8899883 | 10 | 0.0338497 | Preprocessor1_Model15 |
| 15 | 1 | rmse | standard | 0.8913224 | 10 | 0.0317014 | Preprocessor1_Model05 |
| 8 | 4 | rmse | standard | 0.8925670 | 10 | 0.0324308 | Preprocessor1_Model08 |
| 11 | 1 | rmse | standard | 0.8927559 | 10 | 0.0329127 | Preprocessor1_Model04 |
| 8 | 1 | rmse | standard | 0.8928784 | 10 | 0.0314905 | Preprocessor1_Model03 |
Based on hyperparameter tuning and the most recent rendering of this document, the best random forest model uses mtry = 15 and min_n = 8. This means it has selected 15 predictors to be randomly sampled when splitting before each tree model and 8 minimum data points needed to do further splits.
Implementing best hyperparameters based on tuning
# Filling in hyperparameters based on tuning
dc_rf_model <- rand_forest(
trees = 200,
mtry = 15,
min_n = 8
) |>
set_mode(mode = "regression") |>
set_engine(
engine = "ranger",
importance = "impurity",
num.threads = 4
)
# Random forest workflow
dc_rf_workflow <-
workflow() |>
add_recipe(dc_recipe) |>
add_model(dc_rf_model)
rf_resamples <- dc_rf_workflow |>
fit_resamples(resamples = dc_folds)Compare the Linear Regression and Random Forest Models
Plot the RMSEs across each fit:
# Linear Regression: Calculate and plot the RMSE for each resample
kable(collect_metrics(
dc_lm_resamples,
summarize = FALSE) |>
filter(.metric == "rmse"))| id | .metric | .estimator | .estimate | .config |
|---|---|---|---|---|
| Fold01 | rmse | standard | 0.0004106 | Preprocessor1_Model1 |
| Fold02 | rmse | standard | 0.0005899 | Preprocessor1_Model1 |
| Fold03 | rmse | standard | 0.2155494 | Preprocessor1_Model1 |
| Fold04 | rmse | standard | 0.0006500 | Preprocessor1_Model1 |
| Fold05 | rmse | standard | 0.0005780 | Preprocessor1_Model1 |
| Fold06 | rmse | standard | 0.0006337 | Preprocessor1_Model1 |
| Fold07 | rmse | standard | 0.0006282 | Preprocessor1_Model1 |
| Fold08 | rmse | standard | 0.1740280 | Preprocessor1_Model1 |
| Fold09 | rmse | standard | 0.0006290 | Preprocessor1_Model1 |
| Fold10 | rmse | standard | 0.0007083 | Preprocessor1_Model1 |
collect_metrics(
dc_lm_resamples,
summarize = FALSE) |>
filter(.metric == "rmse") |>
ggplot(mapping = aes(x = id, y = .estimate, group = .metric)) +
geom_point() +
geom_line() +
theme_void() +
labs(
title = "RMSE of Linear Regression Model Fits Across Folds") +
xlab("Fold") +
ylab("RMSE")# Random Forests: Calculate and plot the RMSE for each resample
kable(rf_resamples |>
collect_metrics(summarize = FALSE) |>
filter(.metric == "rmse"))| id | .metric | .estimator | .estimate | .config |
|---|---|---|---|---|
| Fold01 | rmse | standard | 0.7839339 | Preprocessor1_Model1 |
| Fold02 | rmse | standard | 0.9487524 | Preprocessor1_Model1 |
| Fold03 | rmse | standard | 0.8090248 | Preprocessor1_Model1 |
| Fold04 | rmse | standard | 0.8965515 | Preprocessor1_Model1 |
| Fold05 | rmse | standard | 0.8107504 | Preprocessor1_Model1 |
| Fold06 | rmse | standard | 0.7601461 | Preprocessor1_Model1 |
| Fold07 | rmse | standard | 1.0226176 | Preprocessor1_Model1 |
| Fold08 | rmse | standard | 0.8536072 | Preprocessor1_Model1 |
| Fold09 | rmse | standard | 1.0213467 | Preprocessor1_Model1 |
| Fold10 | rmse | standard | 1.0053431 | Preprocessor1_Model1 |
rf_resamples |>
collect_metrics(summarize = FALSE) |>
filter(.metric == "rmse") |>
ggplot(mapping = aes(x = id, y = .estimate, group = .metric)) +
geom_point() +
geom_line() +
theme_void() +
labs(
title = "RMSE of Random Forest Model Fits Across Folds") +
xlab("Fold") +
ylab("RMSE")# Compare average rmse
kable(bind_rows(
"Linear regression" = collect_metrics(dc_lm_resamples) |>
filter(.metric == "rmse"),
'Random forest' = collect_metrics(rf_resamples) |>
filter(.metric == "rmse"),
.id = "model"))| model | .metric | .estimator | mean | n | std_err | .config |
|---|---|---|---|---|---|---|
| Linear regression | rmse | standard | 0.0394405 | 10 | 0.0260757 | Preprocessor1_Model1 |
| Random forest | rmse | standard | 0.8912074 | 10 | 0.0323003 | Preprocessor1_Model1 |
The linear regression model outperforms the random forests model significantly, based on a comparison of the average root mean squared error of each model.
Implement the Final Model
# Select the best resamples via select_best
lm_best <- dc_lm_resamples |>
select_best(metric = "rmse")
# Finalize the workflow with finalize_workflow
lm_final <- finalize_workflow(
dc_lm_workflow,
parameters = lm_best
)
# Fit the model on all the data
lm_final_fit <-
lm_final |>
last_fit(dc_split)
# Collect metrics for RMSE
kable(lm_final_fit |>
collect_metrics() |>
filter(.metric == "rmse"))| .metric | .estimator | .estimate | .config |
|---|---|---|---|
| rmse | standard | 0.0006264 | Preprocessor1_Model1 |
# Make predictions with the testing data
lm_predictions <- bind_cols(
dc_test,
predict(object = extract_workflow(lm_final_fit),
new_data = dc_test)
)
# Display the first 10 predictions against the actual values
kable(select(lm_predictions, uhi, starts_with(".pred")))| uhi | .pred |
|---|---|
| 88.46352 | 88.46327 |
| 84.07872 | 84.07895 |
| 89.18240 | 89.18221 |
| 90.83478 | 90.83565 |
| 91.49319 | 91.49289 |
| 87.98800 | 87.98787 |
| 90.91855 | 90.91851 |
| 91.22996 | 91.22987 |
| 86.75634 | 86.75600 |
| 91.28605 | 91.28467 |
| 88.21297 | 88.21425 |
| 89.02017 | 89.02014 |
| 90.88171 | 90.88170 |
| 91.24494 | 91.24481 |
| 91.11902 | 91.11910 |
| 86.09227 | 86.09214 |
| 90.41321 | 90.41312 |
| 88.43620 | 88.43472 |
| 88.88523 | 88.88529 |
| 87.96607 | 87.96593 |
| 91.79575 | 91.79579 |
| 87.62974 | 87.62993 |
| 87.31011 | 87.31014 |
| 89.10649 | 89.10658 |
| 85.47182 | 85.47092 |
| 88.43828 | 88.43847 |
| 91.14506 | 91.14470 |
| 87.59863 | 87.59818 |
| 91.16683 | 91.16666 |
| 91.55821 | 91.55804 |
| 91.65225 | 91.65212 |
| 90.45383 | 90.45434 |
| 89.13717 | 89.13717 |
| 92.78224 | 92.78133 |
| 91.34223 | 91.34092 |
| 90.43526 | 90.43529 |
| 90.55059 | 90.55063 |
| 88.10257 | 88.10274 |
| 88.57215 | 88.57233 |
| 87.30355 | 87.30255 |
| 92.88210 | 92.88055 |
| 93.01053 | 93.01037 |
| 91.20824 | 91.20828 |
| 92.33900 | 92.33915 |
| 91.44217 | 91.44088 |
| 91.31380 | 91.31251 |
| 90.08510 | 90.08529 |
| 90.31454 | 90.31472 |
| 90.34040 | 90.34050 |
| 92.71677 | 92.71653 |
| 91.12966 | 91.13089 |
| 91.70199 | 91.70101 |
| 91.27360 | 91.27245 |
| 91.12722 | 91.12699 |
| 89.90627 | 89.90664 |
| 91.11053 | 91.10946 |
| 91.91128 | 91.91105 |
| 90.12384 | 90.12395 |
| 89.17259 | 89.17246 |
| 92.50475 | 92.50473 |
| 90.61079 | 90.61050 |
| 90.95426 | 90.95404 |
| 90.72494 | 90.72457 |
| 90.45071 | 90.45214 |
| 92.04979 | 92.04977 |
| 93.05506 | 93.05488 |
| 92.57818 | 92.57796 |
| 90.46372 | 90.46361 |
| 90.89459 | 90.89450 |
| 90.60280 | 90.60261 |
| 90.94285 | 90.94233 |
| 92.94851 | 92.94808 |
| 90.06922 | 90.06777 |
| 92.66947 | 92.66875 |
| 92.17783 | 92.17794 |
| 91.43008 | 91.43021 |
| 90.58809 | 90.58815 |
| 89.93667 | 89.93793 |
| 91.61064 | 91.61079 |
| 89.73022 | 89.72999 |
| 90.49735 | 90.49598 |
| 90.67810 | 90.67779 |
| 89.64401 | 89.64401 |
| 90.98889 | 90.98766 |
| 90.37919 | 90.38026 |
| 90.13498 | 90.13510 |
| 90.02812 | 90.02827 |
| 87.61195 | 87.61294 |
| 90.43686 | 90.43564 |
| 89.14399 | 89.14393 |
| 90.03627 | 90.03631 |
| 90.26955 | 90.26962 |
| 89.64193 | 89.64204 |
| 89.54497 | 89.54488 |
| 90.85520 | 90.85521 |
| 90.12619 | 90.12615 |
| 89.77884 | 89.77882 |
| 90.04584 | 90.04449 |
| 89.17864 | 89.17837 |
| 88.87318 | 88.87332 |
| 88.78104 | 88.78113 |
| 89.06382 | 89.06405 |
| 89.34039 | 89.34039 |
| 87.23057 | 87.23009 |
| 87.73870 | 87.73839 |
| 86.04946 | 86.04936 |
| 86.38000 | 86.38147 |
| 87.32229 | 87.32257 |
| 88.43309 | 88.43256 |
| 86.42017 | 86.41998 |
| 86.83332 | 86.83465 |
| 87.54171 | 87.54164 |
| 88.54941 | 88.54935 |
| 87.90347 | 87.90457 |
| 88.84667 | 88.84678 |
| 88.56152 | 88.56154 |
# Interrogate which values were incorrectly predicted to the nearest tenth
kable(lm_predictions |>
select(uhi, .pred) |>
mutate(.pred = round(lm_predictions$.pred, 1),
uhi = round(lm_predictions$uhi, 1),
correct = if_else(.pred == uhi, "Yes", "No"
)) |>
count(correct))| correct | n |
|---|---|
| Yes | 116 |
We were suprised to observe that the linear regression model predicted all UHI values, rounded to the nearest tenth. Before such a model can be implemented, further investigation of this extremely close estimation is needed. However, as we required the model to predict to larger decimals, its performance declined. It predicted 109 values (~94%) to the nearest hundredth correctly, and 73 values (~63%) to the nearest thousandth correctly.
Visualize the Most Important Predictors
lm_final_fit |>
extract_fit_parsnip() |>
vip(num_features = 20) +
labs(title = "Top Predictors in Linear Regression Model") +
ylab("Importance") +
xlab("Variable")Interestingly, the model did not identify any Census American Community Survey variables as key predictors of UHI in the District of Columbia. From this result, we concluded that demographic variables are only related to UHI through their relationships to the landcover variables. Of the 54 predictors fed into each model, the linear regression model identified seven key landcover variables as the most important predictors of UHI:
* OVERALL: The overall suitability for tree planting score of a given block group.
* WLDLF_CON: Available planting areas within 100’ of large canopy tracts (equal to or greater than 5 acres).
* ENRGY_CON: Residentially zoned areas with less than the district average tree cover and higher than average total possible planting area.
* LAND_OWN: Possible planting area on District owned operated or managed public & public/private land. Values are the percentage of plantable space within public land.
* POS_UTC: Positive urban tree canopy.
* STRM_WTR: Uses available planting area on, and adjacent to, impervious surfaces and surface water bodies to indicate areas where trees will more directly benefit stormwater runoff mitigation.
* TO_UN_PCT: The percentage of land area in the assessment boundary that is considered unsuitable for planting. This includes buildings and roads, barren soil and dry vegetation, as well as sports fields, golf course playing areas, airports, agricultural land, wetlands and any other land identified as unsuitable.
In practice, before implementing the model, we would need to explore the definitions of all landcover variables as well as their data collection context more closely with the District. Little information is available publicly on each variable beyond a set of basic definitions. Further, many of these top variables seem to be plausibly related to each other, even though we included step_corr() in our recipe. This warrants further investigation of how step_corr() removes correlated variables and the relationships among the remaining variables.
Practical Use of the Model
We envision that the DC government could leverage the decision tree and/or linear regression models over the years as demographics and land use changes over time. For example, if they have updated data for the predictors for some block groups, the government could run these models to see which block groups may be facing disproportionately high temperatures when Summer comes around. Thus, the government could priortize projects in these areas aimed at improving temperature cooling during hot months. The model could also be used by local nonprofit groups that also provide services to local communities, though these nonprofits should do so in partnership with DC government, who has more contextual information about the landcover variables.
If the DC government is interested in using or adapting this model, we do have some recommendations for future improvement. One would to use more recent temperature data before running the model on current/updated landcover variables; we would hope that the government may be able to access more viable remote sensing data. They should then also use more update Census demographic data as well. If the government may look to improve the models built, we would also recommend being more select with the landcover variables. The public metadata does not offer a lot of information about each variable, whereas the DC government likely has greater context about which variables may be duplicative and should be excluded.
Sources
Arizona State Climate Office. “Urban Heat Island.” Accessed December 17, 2024. https://globalfutures.asu.edu/azclimate/urban-heat-island/.
“SUPR Lab Research.” Accessed December 17, 2024. https://climatecope.research.pdx.edu/.
“Urban Heat Island/UHI Index 2018 (Portland State University) | Tacoma Open Data.” Accessed December 17, 2024. https://data.cityoftacoma.org/datasets/tacoma::urban-heat-island-uhi-index-2018-portland-state-university/about.
“Urban Tree Canopy by Census Block Group in 2020.” Vector digital data, April 5, 2022. https://opendata.dc.gov/datasets/DCGIS::urban-tree-canopy-by-census-block-group-in-2020/about.
US Census Bureau. “American Community Survey: 5-Year Estimates: Detailed Tables 5-Year.” Accessed December 17, 2024. https://api.census.gov/data/2020/acs/acs5.html.
US Census Bureau. “American Community Survey: Multiyear Accuracy of the Data (5-year 2016-2020).” Accessed December 17, 2024. https://www2.census.gov/programs-surveys/acs/tech_docs/accuracy/MultiyearACSAccuracyofData2020.pdf
“Notes on ACS Estimate and Annotation Values.” Census.gov. Accessed December 17, 2024. https://www.census.gov/data/developers/data-sets/acs-1year/notes-on-acs-estimate-and-annotation-values.html.
US EPA, OAR. “What Are Heat Islands?” Overviews and Factsheets, June 17, 2014. https://www.epa.gov/heatislands/what-are-heat-islands.